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Abstract 

We present a theoretical analysis of Maxi- 
mum a Posteriori (MAP) sequence estima- 
tion for binary symmetric hidden Markov 
processes. We reduce the MAP estimation 
to the energy minimization of an appropri- 
ately defined Ising spin model, and focus on 
the performance of MAP as characterized by 
its accuracy and the number of solutions cor- 
responding to a typical observed sequence. It 
is shown that for a finite range of sufficiently 
low noise levels, the solution is uniquely re- 
lated to the observed sequence, while the ac- 
curacy degrades linearly with increasing the 
noise strength. For intermediate noise values, 
the accuracy is nearly noise-independent, but 
now there are exponentially many solutions to 
the estimation problem, which is reflected in 
non-zero ground-state entropy for the Ising 
model. Finally, for even larger noise intensi- 
ties, the number of solutions reduces again, 
but the accuracy is poor. It is shown that 
these regimes are different thermodynamic 
phases of the Ising model that are related to 
each other via first-order phase transitions. 



1 Introduction 

Hidden Markov Models (HMM) are used extensively 
for modeling sequential data in various areas [9l [4]: 
information theory, signal processing, bioinformat- 
ics, mathematical economics, linguistics, etc. One of 
the main problems underlying many applications of 
HMMs amounts to inferring the hidden state sequence 
x based on noise-corrupted observation sequence y. 
This is often done through maximum a posteriori 
(MAP) approach, which finds an estimate x(y) by 
maximizing the posterior probability Pr(x|y). 



The computational solution to the MAP optimiza- 
tion problem is readily available via the Viterbi algo- 
rithm [3]. Despite its extensive use in many applica- 
tions, however, the properties of MAP estimation, and 
specifically, the structure of its solution space, have re- 
ceived surprisingly little attention. On the other hand, 
it is clear that choosing a single state sequence might 
be insufficient for adequately understanding the struc- 
ture of the inferred process. To get a more complete 
picture, one needs to know whether there are other 
nearly optimal sequences, how many of them, how they 
compare with the optimal solution, and so on. 

Generally, the structure of an inference method can be 
characterized by the accuracy of the estimation, and 
the number Af(y) of solutions x(y) that the method 
can produce in response to a given sequence y. In 
this paper we study the structure of MAP inference 
for the simplest binary, symmetric HMM. As an ac- 
curacy measure we employ the moments of the esti- 
mated sequence x in comparison of those of the ac- 
tual sequence x, while the number A/"(y) of possible 
estimates will be characterized by its averaged loga- 
rithm X) y P r (y) m A/"(y). The binary symmetric HMM 
is studied by reducing it to the Ising model in random 
fields, a relation well-known both in computer science 
[6] and statistical physics [8|. In this way, the average 
cost — Pr(y)Pr(x(y)|y) of MAP and the logarithm 
of the number of solutions ^ y Pr(y) m A/"(y) relate, re- 
spectively, to the energy and the entropy of the Ising 
model at the zero temperature. 

Our results indicate that even for a simple process such 
as binary symmetric HMM, MAP yields a very rich 
and non-trivial solution structure. The main findings 
can be summarized as follows: For a small, but finite 
range of noise values the MAP solution is uniquely 
related to the observed sequence, and the accuracy of 
the solution degrades linearly with increasing the noise 
strength. For intermediate values of noise the accuracy 
is nearly noise-independent, but now there are expo- 
nentially many solutions to the estimation problem, 



which is reflected in non-zero ground-state entropy 
for the Ising model. Finally, for larger noise inten- 
sities the number of solutions is reduced again, but 
the accuracy is poor. Furthermore, those regimes are 
the manifestation of different thermodynamic phases 
of the Ising model, which are related to each other via 
first-order phase transitions. 

The rest of the paper is organized as follows: After 
some general discussion of MAP scheme in Section [5J 
we define the model studied here in Section [3] Its so- 
lution is given in Sections Q] and [5l The latter also dis- 
cusses our concrete findings on the structure of MAP 
for the binary symmetric HMM. We conclude the pa- 
per by discussion of our results and future work. 

2 Maximum a posteriori (MAP) 
estimation: general description 

Let x = (xi, . . . , xn) and y = (y\, . . . , y^) be real- 
izations of discrete-time random processes X and y, 
respectively. We write their probabilities as p(x) and 
p(y)- We assume that y is the noisy observation 
of X; the influence of noise is described by the con- 
ditional probability p(y|x). Let us further assume 
that we are given an observed sequence y, and we 
know the probabilities p(y|x) and p(x). We do not 
know which specific sequence x generated the obser- 
vation y. MAP offers a method for estimating the 
generating sequence x(y) on the ground of y: x is 
found by maximizing over x the posterior probabil- 
ity P(x|y) = p(y|x)p(x)/p(y). Since p(y) does not 
depend on x, we can equally well minimize 

-ln[p(y|x)p(x)]=JJ(y,x). (1) 

The advantage of using H (y, x) is that if y is ergodic 
(in the sense of weak law of large numbers) [5] , which 
we assume from now on, then for N ^> 1, H(y, x(y)) 
will be independent from y, if y belongs to the typical 
set ttiy(y) p]. The typical set has the overall proba- 
bility converging to one: J2 y£ n N (y) P(y) ~* 1 - Since 
all elements of f2jv(!V) have (nearly) equal probabil- 
ity, we can employ with probability one the averaged 
quantity £) y p(y)H (y, x (y)) instead of ff(y,x(y)). 

If the noise is very weak, p(x|y) ~ #(x — y) = 
[Tfe=i $( x k ~Vk) (with S(x) being the Kronecker delta), 
we recover the generating sequence almost exactly. For 
a strong noise the estimation is dominated by the 
prior distribution p(x|y) ~ p(x), so that the esti- 
mation is not informative. When no priors are put, 
p(x) oc const, the MAP estimation reduces to the 
Maximum Likelihood (ML) estimation scheme. The 
latter also reproduces the source sequence almost ex- 
actly if the noise is weak. 



According to the Viterbi algortithm, for a given y 
the mimimization of 7J(y,x) in {!]) produces one sin- 
gle estimate x(y). However, it is possible that there 
are other sequences x[ Q l(y) for which H (y, x^ (y)), 
though greater than iJ(y,x(y)), is almost equal to 

the latter in the sense of limy H ^ y, * N = 
limjv^oo H ( y '*( y ^ . All such sequences are equivalent 
for N — > oo and we list them as possible solutions: 

X H( y ), a=l,...,M(y). (2) 

If lnA/"(y) oc N, we repeat the above ergodicity ar- 
gument and get for the logarithm of the number of 
solutions corresponding to a typical observed sequence 

6 = V p(y)lnJV(y). (3) 

A finite ^ means that there are exponentially many 
outcomes of minimizing if(y,x) over x. We call 
entropy, since it relates to the entropy of the Ising 
model; see below. 

We can calculate various moments of xI Q l (y) , which 
are random variables due to the dependence on y, and 
employ them for characterizing the accuracy of the 
estimation; see below for examples. For small noise 
values these moments will be close to those of the orig- 
inal process X . Another useful quantity is the average 
overlap between the estimated sequences x^ (y) , and 
the observed sequence y (definition of overlap is clari- 
fied below). A small overlap means that the estimation 
is not dominated by observations. 

3 Binary symmetric hidden markov 
model (HMM). 

3.1 Definition. 

We consider the MAP estimation of a binary, 
discrete-time Markov stochastic process X = 
(Xi, X%, . . . , Xn)- Each random variable Xk has only 
two realizations Xk = ±1. The Markov feature implies 

p ( x ) = H k=2 P( x k\xk-i)p{xi), (4) 

where p(x x \xk-i) is a time-independent transition 
probability of the Markov process. For the considered 
binary symmetric situation it is parameterized by a 
single number < q < 1, p(l|l) = p( — 1 1 — 1) = 1 — q, 
p(l\ — 1) = p( — 1 1 1 ) = q, and the stationary distri- 
bution is jPst(l) = Pst(~ ■ 1) = \- Furthermore, the 
noise process is assumed to be memory-less, time- 
independent and unbiased: 

p(yl x ) = ll k=1 Ayk\xk), Vk = ±i (5) 



where tt(— 1|1) = tt(1 - 1) = e, tt(1|1) = vr(-l| - 
1) = 1 — e, and e is the probability of error. Here 
memory- less refers to the factorization in time- 
independence refers to the fact that in flSJ) 7r(...|...) does 
not depend on k, while unbiased means that the noise 
acts symmetrically on both realizations of the Markov 
process: 7r(l| — 1) = 7r(— 1|1). 

Note that the composite process Xy with realizations 
{yk,Xk) is Markov with transition probabilities 

p{yk+i,x k +i\yk,x k ) = ir(y k +i\x k+ i)p(x k +i\xk). (6) 
However, y is in general not a Markov process. 

3.2 Mapping to the Ising model. 

Let us represent the transition probabilities as 



H(y,x) interacting with a thermal bath at tempera- 
ture T, and with frozen (i.e., fixed for each site) ran- 
dom fields y k [7j. For T — > 0, and a given y, the func- 
tion e -,3i/ ( y ' x ) is strongly picked at those x(y) [ground 
states], which minimize H (y, x). If, however, the limit 
T — > is taken after the limit N — * oo, we get 



P(x|y) ^^^ [X ~ AH(y)] ' 



(11) 



where x^l and Af(y) were defined in ([2]). From now 
on we understand the limit T — > in this sense. 

The average of H [average energy] in the T — > limit 
will be equal to the H (y, x) minimized over x: 



p(y)tf(y,xW(y)).(12 



p(x k \x k -i) 



„ JXfeXfc_i 

2 cosh J 



1-9 



(7) 



Likewise, we represent the noise model as 



£ P(y)p(x|y)ff(y,x) = 

■C 'xy y 

where we have used the fact that all ground state con- 
figurations x(y) have the same energy, H(y,x\ a ^) = 
if(y,xW), for any a. 

The average logarithm 8 of the number of MAP solu- 
tions is equal to the zero-temperature entropy 



e hViXi 

2 cosh h ' 



1 



In 



1 



(8) 6 = -T P(y)p(x|y) In p(x|y) = T p(y) In A%). 

' ' X v - ■ v 



We combine JU HHS]) to represent the log-likelihood as us introduce the the free energy: 



H 



(y,x) = ~J2_^ k=i XkXk+i - h 2_^ k=1 



VkXk, 



(9) 



F(J,h,T) = -TJ2 P(y)ln 



,-/3H(y,x;J,h) 



, (13) 



where we have omitted an irrelevant additive factor. 
H(y,x) is the Hamiltonian of a one-dimensional (Id) 
Ising spin model with external random fields hy k gov- 
erned by the probability p(y) [TTj . The factor J in 
(J5|) is the spin-spin interaction constant, uniquely de- 
termined from the transition probability q: If q < 1, 
the constant J is positive, which refers to the ferro- 
magnetic situation: the spin-spin interaction tends to 
align the spins. From now on we assume J > 0, h > 0. 
We note that the main difference between (j9|) and 
other random-field Ising models considered in litera- 
ture 8, l a , is that in our situation the random fields are 
not uncorrelated random variables, but display non- 
Mar kovian correlations. 

3.3 Implementation of MAP 

To minimize ^2 y p(y)H(y, x) over x, we introduce a 
non-zero temperature T = 4 > 0, and define the fol- 
lowing conditional probability 



defined with the Ising Hamiltonian ([9|). The entropy 
6 is expressed via the free energy as [see (J3J IT2"1) ]: 



e 



(14) 



Furthermore, we define the following relevant charac- 
teristics of MAP: 



1 v-^N 

p(y)Kx|y)^E fc=i 



V = Y] p(y)p(x|y; 



N ■ 
1 

j^2^ k=1 



x k x k+ i 



1 

N 



VkX k 



N 



d h F, (15) 



Here c accounts for the correlations between neigh- 
bouring spins in the estimated sequence, while v mea- 
sures the overlap between the estimated and the ob- 
served sequences (the average Hamming distance be- 
tween the two is simply 1 — v). In the limiting case of 
very weak noise, when the magnitude h of the ran- 
dom fields is large [see ©], we have v — > vq = 1 
(observation-dominance), while c is equal to the cor- 
responding value cq of the Markov process X: 



p(x|y) 



,-/3if(y,x) 

z(y) 



, (io) 



c = c 



x 1 x 2 p e ,t{x 1 )p{x 2 \xi) = 1 - 2q. (U 



where Z(y) is the partition function. In the terminol- 
ogy of statistical physics, /o(x|y) gives the probability 
distribution of states x for a system with Hamiltonian 



And for very strong noise (the probability of error e 
is close to i, which means h — > 0), v nullifies, while 
c goes to the corresponding values calculated over the 
prior distribution p(x): c = sign(J). 



4 Recursion relation 

Let us return to the partition function (jlOp 

x-l = ±1...xn=±1 

We apply to Z(y) to the following transformations [T]: 

/3 Ja;ia;2+/3/iyia;i 



X2-..X N 



= e^ J ^™=2 Kfc + lXfc+ ^^™=3 yfcXfc+ ^ 2X2+/3£!(6) , 

x 2 ...a;jv 

where £2 = hij2 + ^.(£1), £1 — an d where 



A(u) - 1 ln cosh ^ J + H 
v ; 2/3 cosh[/3J - f3u] ' 



(17) 



B(w) = — In [4cosh[/3J + /3u]cosh[/3J - f3u}} . (18) 

Thus, once the first spin is excluded, the field acting 
on the second spin changes from hy2 to hy2 + A(£i). 
Note the zero-temperature (/3 — > 00) limits (J > 0) 

A{u) = u§{J ~ \u\) + M{u - J) - M(-u - J), (19) 
B{u) = M(J - \u\) + u$(u - J) - uti(-u - J), (20) 

where i)(x) = for x < and d(x) = 1 for x > 0. 

Repeating the above steps we express the partition 
function as follows: 

Z(y) = e ^L 1 %) i (21) 

where is obtained from the recursion relation 

£ k = hy k + A(^ 1 ), fc = l,2,...,iV, Co = 0. (22) 

This is a random recursion relation, since y k are ran- 
dom quantities governed by the probability p(y). De- 
pending on the value of yk, £fc+i can take values 
h + ^4(£fc_i) or — h + Even when yk assumes 

a finite number of values, from (|22p can in princi- 
ple assume an infinite number of values. Fortunately, 
for T -> 0, due to the special form (HI [20]) of A(u) 
and B(u), the number of values assumed by is fi- 
nite (though it can be large). It is checked by inspec- 
tion that the values taken by are parametrized as 
C( n i) n 2) = {n\h + 712J), where n\ is a positive or 
negative integer, while 77,2 can assume only three val- 
ues 0, ±1. It can also be seen that the states C( n i: 0) 
are not recurrent: once takes a value with n 2 = ±1 
(note that there is a finite probability for that), it shall 
never return to the states £(ni, 0). In the limit N ~S> 1, 
we can completely disregard the states C( n ij 0). 

Now recall that the process y with probabilities p(y) 
is not Markov. To make it Markov we should enlarge 



it by adding the random variable z; see ©. Here we 
write the realizations of this auxiliary Markov process 
Z as z, so as not to mix them with those of orig- 
inal process x. [X and Z have identical statistical 
characteristics, but these are different processes: Z 
is employed merely for making the composite process 
Markov.) Likewise, we make the process with realiza- 
tions [£,y] Markov by enlarging it to [£, y, z]. Let us 
denote this composite Markov process by C. Its con- 
ditional probabilities read 

u>(£, y, z\e, y', z 1 ) = p(z\z')n(y\zM^ , y), (23) 

where p(z\z') and ir(y\z) refer to the Markov process 
X and the noise, while takes two values 

and 1, depending on whether the corresponding tran- 
sition is allowed or not by recursion (f22|) . Now the task 
is to find all possible values of and then to deter- 
mine <£>(£|£',y). Before turning to this task, we relate 
the characteristics of the studied MAP estimation to 
the stationary probabilities cj(£,?/, z) of the composite 
Markov process C. First we get from w(£, y, z) the sta- 
tionary probabilities Next we return to ([21]) and 
to the definition of free energy (113|) . Since the com- 
posite Markov process C will be seen to be ergodic, the 
free energy can be written as [T] 

- /(J, h) = -F(J, h)/N = J2u(0B($, (24) 

where the summation is taken over all possible [for a 
given range of (J, h)] values of £. Once /(J, h) is found, 
we can apply (|3.3l [H>]) . 

As for entropy ([14]) we get from ([TH] [21]) 

F (y) = -|E1X s =± ln [ 2cosh [^ + sJ )l] • ( 25 ) 

In this expression we should now select the terms 
which survive T — > and 8t '■ 

- d T F(y)\ T ^ Q = ^dT ± J)} t q , 

where 5(.) is the Kronecker symbol. In the limit N 3> 
1, j? SfcLi ^(Cfei^) should - with probability one, i.e., 
for the elements of the typical set £l(y) - converge to 
w(£ = J) + w(£ = — J), provided that the composite 
Markov process is ergodic. We thus get [1] 



= e/N = \n2[u(J) + lj{-J)]/2. 



(26) 



The physical meaning of this formula is that the zero- 
temperature entropy can be extensive only when the 
external field £ acting on the spin has the same energy 
t;Xk = ±1 as the spin-spin coupling constant J; see 
([7]). If this is the case, then a macroscopic amount of 
spins is frustrated, i.e., the factors influencing those 
spins compensate each other, so that their sign is not 
predetermined even at the zero temperature. 




Figure 1: The transition graph between various states 
for m = 1; see (|2"?| . 




Figure 2: Transitions between various states ([29]) for 
to = 3; sec (|2"T|) . This is one half of the real transition 
graph. The second half is obtained from the above one 
by adding bars to all above symbols: a — > a, b — ► b; 
see 



4.1 Stationary states of the recursion 

For given J and h define an integer m as 
2J/(m - 1) > h > 2J/m, m = l,2,... 



(27) 



Note that the case ft > 2 J (and there is no upper limit 
on ft) corresponds to m = 1. One can check that for 
each integer m the recurrent states [£, y] assumed by 
the recursion (|2"2"]) can be parametrized as 

{ ai , h, a u 6,}™ x , (28) 
en = [(2 - i)h + J, 1] = [a,, 1], ai = [ -ai, -1 ], 
6, = [-»*+ J, -1] = 6i=[-A,l]- (29) 

Note the symmetry a* = — a* and 6j = — 6j. The transi- 
tions between these states — which via the binary func- 
tion ip determine the transition matrix in (|23|) — are 
illustrated in Figs. [T] and [2] for m = 1 and m = 3, re- 
spectively. The reader can easily generalize the latter 
graph to an arbitrary m. 



We are now prepared to write down from 
and Fig. [T] the following transition matrix for the com- 
posite Markov process C with m = 1 



W 



' fl ai \ai 








W a 1 \b 1 










W 6l|fcl 





w ai | bl 


™ai \ai 
















(30) 



This is a block matrix composed of 2 x 2 matrices 
(hence the actual size of W in ([30)1 is 8 x 8): means 
the 2x2 matrix with all its elements equal to 0, and 



P, 



..\a 1 =W...\bi=P, *i 

7t(+l\x)p(x\x'), M xx 



.\h = w...\a t = M, (31) 
= n(-l\x)p(x\x'), (32) 



where x,x' = ±1. Note that P + M is equal to the 
transition matrix of the Markov process X. Once the 
8x1 stationary probability vector w of W is found from 
Ww = To, we get w(ai) = w\ + w%, w(/3i) = + 7774, 
w(ai) = 77J5 + 77J 6 , and co(Pi) = w 7 + w S - 

For a general m the following m x m matrices serve as 
building blocks for the matrix W 



L 

U - 



z {Lis+i 



= 1, 
1, 



. , 777 

. , 777 - 



■1},E 
1}, S 



{Ei,i 

{Sim 



1}, 
1}, 



where all not indicated elements are equal to zero. The 
transition matrix for a general m reads [see Figs. [HE] 



T = 


' M 
P 


" 



(8) 


" 



" 

E 










(3c 
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P ' 




" L 


" 
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M ' 







" 





M 










+ 





P 


S 1 


L 


p 


" 




" U 


' 




" M 


' 




" 


" 


M 













+ 


P 








u 


P 


" 




E 


" 




' 


p ' 




" 


s ' 


M 













+ 





M 











The left matrices of each tensor product is a block 
matrix; each block consists of one 2x2 matrix. The 
right matrices of each tensor product are also block 
matrices; now each block consists of one 777 x 777 matrix. 
The zero 777 x 777 matrix is written as 0. The overall size 
of W is 8777 x 8777, since each state in ([2"5)l is augmented 
by two realizations of the hidden Markov process. 

Note that going from one value of m to another 
amounts to changing the dimension of the matrices E, 
U , S and L. Since these matrices are sparse, efficient 
numerical algorithms of treating them are available, 
even for larger values of m. 

5 MAP inference 

Let us indicate how the quantities of interest are ex- 
pressed via the stationary probability Li of the Markov 
process C (obtained from ([33]) ). Recall that since the 
estimated process is unbiased, we are interested in the 
second moment c, overlap v and entropy 9. The former 
two quantities have to be obtained via the free energy. 
To this end, we trace out the redundant variables in 
the stationary probability of W to obtain the following 
probabilities (i = 1, . . . , 777): 



W m (a 4 ) = W m (ai), UJ m (Pk) = UJm(Pk), 



(34) 



where the equalities in are due to the symmetry 
of the unbiased situation. We add a lower index to 
relevant quantities (e.g., to uj's) to indicate the spe- 
cific value of to. Recall that, e.g., wi(ai) and W2(ai) 
are in general different quantities, since they belong to 
different Markov processes C\ and C 2 , respectively. 



Due to ([34]) , we shall need only the probabilities 
w m (afc) and u) m {(3 k ) that normalize to one-half: 



This equation is written down assuming that the value 

of 9 at h = ^ does not depend on whether the latter 

point is reached as h — * — + or as h — » — — 0. This 

P 1 m 
assumptions leads from (|4"T]) to: 

u m (a 2 ) +uj m (/3 m ) = u; ro+ i(o!2) + w m +i(/3 m ). (42) 

This relation has the same origin as the continuity of 
the free energy. 



[uj m (a k ) +u m (0 k )] = 1/2. 



The free energy then reads (see ([2TJI |2"4"]) ) 
~ f -f=Y2=x ["m(a k )B(a k )+oj m (p k )B(0 k )} 
= /i[w m (ai) + mw m (/3 m ) ] + J[ - - 2w m (j3 m )]. (36) 

Now we make use of the fact that free energy is a 
continuous function of its parameters 0, which in our 
case implies 

fm = fm+i at h = 2J/to, to = 1, 2, . . . (37) 
This leads from ([3"r?|) to 

W m (oi) = W m +l(«l) + W m +l(/?m)< (38) 



(35) 5.1 The regime to = 1 or ft, > 2J. 

We deduce for the stationary probabilities from 



One can confirm ([38]) from ([43] 144 1 135] ) . Note that 
([38]) will hold for all values of e and g, since it does not 
depend on h and/or J (the formalism holds without 
requiring any specific relation between h, J and e, q). 
Combining ([3"tj]) with (|15p from Section [3~31 we obtain 
for the second moment c m of the estimated sequence 
and the overlap v m 

Cm = l- 4w m (/3 m ), v m = 2w m («i) + 2muj m (f3 m ). (39) 



As seen from ([28 ] 129" ]) . if the relations (J2TJ) hold [recall 
that they are strict inequalities], there are only two 
realizations cv 2 = J and cv 2 = — J, which, according to 
(|26|) . contribute into the entropy. Recalling also ([34f. 
we get (to = 1, 2, . . .) 

2 1 2 7 

6 = uj m {a 2 ) In 2, for > h > —. (40) 

TO — f TO 

This relation holds for to = 1, if we assume uj\ (a^) = 0. 

At the transition points h — — between the various 

m 

regimes (f2~T)) . there are more states that contribute into 

the entropy. The reader can verify that 
9 = [w m (a a ) + Wm(An)] In 2, if ft = 2J/m. (41) 



1 Outside phase transitions free energy is smooth, while 
at the phase-transition points it has to be at least contin- 
uous, since, besides being the generating function for cal- 
culating various averages, free energy is also a measure of 
dynamic stability, and at the phase-transition points both 
phases are equally stable by definition (see [7] for more 
details) . 



wi(ai) = u>i(h + J) 



+ e(l-e)(2«-l), (43) 



wiCSi) = u>i(-h + J) = I - e(l - e)(2q - 1). (44) 

This implies from ([39]) v x = 1, c x = (1 - 2g)(l - 2e) 2 . 
This is in fact the Maximum Likelihood (ML) regime: 
the noise is so small (or h is so large) that the es- 
timation is completely governed by the observations: 
Vx = 1. The second moment c\ of the estimated se- 
quence in this regime is given by the original value 
Co = 1 — 2q (see (|16j) ) times the squared error proba- 
bility (f — 2e) 2 . The entropy in this regime is zero (see 
CDl ): 9\ = 0. In this sense the estimation is uniquely 
determined by observations. We stress that the ML 
and MAP schemes agree with each other not only for 
very small, but also for finite noises. 

5.2 The regimes m = 2 and to = 3. 

For more compact presentation of the probabilities, let 
us introduce separate notations for the noise strength 
and the Markov correlator g = e(l — e), u = 1 — 2q, 
where < g < \ and < U < 1; see ([15]). The 
probabilities obtained from (|33p are written as 
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We skip a tedious analytical expressions for U3. 



The values of c and w deduced from ([39l |45]) are 
shown in Figs. 3(a) and |3(b)[ We compare those 
values with the results obtained by actually finding 
the MAP estimate via the Viterbi algorithm, and cal- 
culating those quantities directly. It is seen that at 
the regime change points h = 2J and h — J, v and 
c experience sudden jumps, or first-order phase tran- 
sitions. Remarkably, those features are perfectly re- 



produced in the simulations, as shown in Figure 3(a) 



m 


4 


5 


6 


7 


e 


0.3700 


0.3910 


0.4100 


0.421 


(9/ In 2 


0.07308 


0.06587 


0.05925 


0.05349 



Table 1: Regular values of entrop y A for q — 0.24; 



see P0|) . This table continues Fig. 3(c) towards larger 
values of the noise strength e. 



h 


2J 


J 


2J/3 


J/2 


2J/5 


e 


0.0907 


0.2400 


0.3598 


0.3867 


0.4051 


9/ In 2 


0.1629 


0.1462 


0.1220 


0.0992 


0.0831 



Table 2: The special values of entropy f° r Q — 0.24. 




(a) 



and 3(b) For instance, in the ML regime h > 2J 
(0 < e < 0.09068), the overlap v — 1 indicating that 
the estimation is governed solely by observations. At 
h = 2J it jumps sharply, and then monotonically de- 
creases in the regime 2J > h > J. More generally, 
v decays, both monotonously and via jumps, towards 
the prior-dominated value v = 0. 

Consider the second moment c of the estimated se- 



quence shown in Figure 3(b) We see that c is nearly 
a constant for each given m > 2. This is the main 
virtue of MAP scheme as compared to the ML scheme: 
While the latter predicts a c that quickly decays with 
the noise as cml = (1 — 2q)(l — 2e) 2 (the dotted line in 
the plot), the proper MAP value of c is not far from its 
noise-free value cq = 1 — 2q, and is nearly a constant 
for a finite range of noise strength e. This advantage 
of MAP over ML is due to supporting the estimation 
process by the priors. Indeed, the values of the overlap 
indicate that the estimated sequence is not completely 
driven by the observations, though it is still not very 
far from them. Upon increasing e towards its maximal 
value e = i, c experiences jumps during each regime 
change. For larger m these jumps are smaller and more 
frequent, leading c to its prior-dominated value 1. 

Now let us focus on the entropy: It naturally nullifies 
in the ML regime (6q = 0), while in the regime m — 2 
the entropy 9i is monotonously increasing with e, as 
shown in Figure 3(c) At h — 2J (the phase transi- 



tion point, where m changes from 1 to 2) #2 experi- 
ences a jump, which is again usual for first-order phase- 
transitions. 9 maximizes at an intermediate value of e, 
and then decays to zero for e — ► A , see Table [TJ at this 
point the present approach reduces to a ferromagnetic 
Id Ising model without magnetic fields. This model 
has a trivial ground-state structure and hence zero en- 
tropy. We also note that right at the transition points 
h = — the values of 9 is different; see Table [2] The 

m ' 1 — 1 

largest value is attained for h — 2J . 

Finally, we would like to note that the second moment 
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Figure 3: MAP characteristics versus the noise inten- 
sity in the regimes m = 1, 2, 3 for q = 0.24: (a) Over- 

In 



In 2 ■ 



lap v; (b) the second moment c; (c) Entropy 
(a) and (b) the open squares represent simulation re- 
sults, obtained by running the Viterbi algorithm and 
calculating the respective quantities directly. We used 
sequences of size 10 4 , and averaged the results over 
100 random trials. 



of the estimated sequence, c, is an indirect measure of 
accuracy. In practice, one is restricted to use such indi- 
rect measures as information about the true sequence 



might not available. In Figure[5]we present the average 
error rate for the MAP estimation, which is given by 
the normalized Hamming distance between the true 
and Viterbi-decoded sequences, plotted against the 
noise intensity. Also shown is the average error rate of 
ML estimation, which is simply e. For vanishing noise, 
both MAP and ML yield the same average error. Upon 
increasing the noise intensity, the MAP estimation er- 
ror behaves differently depending on the parameter q: 
For small values of q, MAP is always superior to ML 
for a wide range of noise intensities. For larger val- 
ues of q, however, the situation is more complicated: 
Although both methods perform similarly, there are 
some differences and crossovers between the two at in- 
termediate noise intensities, as shown in Figure HI 




0.1 0.2 0.3 0.4 0.5 

e 



Figure 4: The average error rate given by the nor- 
malized Hamming distance between the true and the 
estimated sequences. 

6 Discussion 

We theoretically examined Maximum a Posteriori 
(MAP) estimation for hidden Markov sequences, and 
found that MAP yields a non-trivial solution struc- 
ture even for the simple binary and unbiased hid- 
den Markov process considered here. We demon- 
strated that for a finite range of noise intensities, there 
is no difference between MAP and Maximum Likeli- 
hood (ML) estimations, as the solution is observation- 
dominated. While it was expected that the two meth- 
ods agree for a vanishing noise, the fact of their exact 
agreement for a finite range of the noise is non-trivial. 
Furthermore, upon increasing the noise intensity the 
MAP solution switches between different operational 
regimes that are separated by first-order phase tran- 
sitions. In particular, a first-order phase-transition 
separates the regime where MAP and ML agree ex- 
actly. At this transition point the influence of the prior 
information becomes comparable to the influence of 



observations. In the vicinity of the first-order phase- 
transitions the performance of MAP (e.g., characteris- 
tics of the estimated sequence) changes abruptly. This 
means that a small change in the noise intensity may 
lead to a large change in the performance. In other 
words, the phase-transition points should be avoided 
in applications. 

For practical applications of HMM (e.g., in speech 
recognition, or machine translation) it is not enough 
to know the single solution that provides the largest 
posteriori probability [5]. At the very least, one should 
also know how many sequences have a posteriori prob- 
abilities sufficiently close to the optimal one. Moti- 
vated by this fact, we studied the number N of MAP- 
solutions that have (for long sequences N — > oo) al- 
most equal logarithms of the posterior probability. A 
finite 9 = -^lnA/" means that there is an exponen- 
tial number of solutions with posterior probabilities 
slightly less than the optimal. We found that 6 is finite 
whenever MAP differs from ML. We believe that this 
theoretical result might have practical implications as 
well. For instance, in applications such as statistical 
machine translation, one usually considers top K so- 
lutions to the inference problem, and then chooses one 
according to some heuristics. Our result suggests that 
one needs to be careful with this practice whenever 
is non-zero, as one might discard a large number of 
nearly optimal solutions if K is not chosen sufficiently 
large. 

We also note that our work is directly related to the 
notion of trackability, which can be intuitively defined 
as one's ability to (accurately) track certain stochastic 
processes [3] [10]. In fact, a similar phase-transition 
in the number of solutions was reported by Crespi 
et. al. [3] for so called weak models, where the en- 
tries in the HMP transition and emission matrices are 
either or 1. For more general stochastic processes, 
an information-theoretical characterization of tracka- 
bility was suggested in [TU] . Within this approach, the 
accuracy is characterized by the probability Pr[x ^ x] 
of the estimated sequence x not being equal to the ac- 
tual one, while the structure of the solution space is 
described via the number of elements |f2| in the (condi- 
tional) typical set f2 of x sequences given an observed 
sequence y (complexity). Both these quantities relate 
to the conditional entropy — ^ x y Pr(x, y) lnPr(x|y). 
We note that whereas the accuracy and the complex- 
ity measures of [TU] deteriorate even for a small (but 
generic) noise intensity, our approach of defining track- 
ability in terms of the zero-temperature entropy of the 
Ising Hamiltonian (Equation[9j suggests that a process 
can be trackable in the MAP sense even in the presence 
of moderate noise. 

Finally, we would like to note that another interest- 



ing feature of the MAP estimation is that its charac- 
teristics (c and v) change only slightly in between the 
phase-transition points. In contrast to ML estimation, 
which deteriorates (at least linearly) when increasing 
the noise, MAP estimation is stable for a finite range 
of noise intensities. Thus, although MAP estimation 
may be less accurate compared to ML, it can be still 
useful as far as its stability is concerned, provided that 
its range of application is selected carefully. 

There are several directions for further developments. 
First, we intend to obtain analytical results for the av- 
erage error rate to complement the empirical analysis 
presented in Figure |U Furthermore, one can think of a 
semi-supervised MAP estimation, where one has (pos- 
sibly noisy) knowledge about the states of the hidden 
process at particular times. Remarkably, the frame- 
work presented here allows a natural generalization to 
this case. Indeed, one simply needs to modify the Ising 
energy function by adding quenched fields at the cor- 
responding locations in the chain. Finally, it will be 
interesting to generalize the analysis presented here be- 
yond the binary hidden Markov processes considered 
here. In this case, the MAP optimization problem can 
be mapped to a Potts model. We would like to note 
that the behavior observed in the simple binary model 
can be explained by the emergence of a finite frac- 
tion of "frustrated" spins, where the frustration can 
be attributed to two competing tendencies - accom- 
modating observations on one hand, and the hidden 
(Markovian) dynamical model on the other. Since this 
mechanism is rather general, we believe that most fea- 
tures of the MAP scheme uncovered here via an exact 
analysis of the simplest binary model will survive in 
more general situations. 
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